Search CORE

827 research outputs found

Robust Estimation with Discrete Explanatory Variables

Author: FR Hampel
M Hubert
M Orhan
P Čížek
PJ Rousseeuw
PJ Rousseeuw
PJ Rousseeuw
Publication venue: Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät
Publication date: 01/01/2002
Field of study

Crossref

Dokumenten-Publikationsserver der Humboldt-Universität zu Berlin

A Fast Algorithm for Robust Regression with Penalised Trimmed Squares

Author: A Giloni
AC Atkinson
AC Atkinson
AS Hadi
C Agostinelli
CW Coakley
D Gervini
D Peña
D Peña
DM Hawkins
DM Hawkins
DM Hawkins
DM Sebert
G Zioutas
G Zioutas
G. Zioutas
J Agulló
JF Gentleman
L. Pitsoulis
LM Li
LS Pitsoulis
M Salibian-Barrera
MS Bazaraa
N Billor
N Billor
N Billor
O Hössjer
PJ Rousseeuw
PJ Rousseeuw
PJ Rousseeuw
PJ Rousseeuw
RJ Rousseeuw
TA Feo
VJ Yohai
VJ Yohai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The presence of groups containing high leverage outliers makes linear regression a difficult problem due to the masking effect. The available high breakdown estimators based on Least Trimmed Squares often do not succeed in detecting masked high leverage outliers in finite samples. An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS) estimator, was introduced by the authors in \cite{ZiouAv:05,ZiAvPi:07} and it appears to be less sensitive to the masking problem. This estimator is defined by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective function a penalty cost for each observation is included which serves as an upper bound on the residual error for any feasible regression line. Since the PTS does not require presetting the number of outliers to delete from the data set, it has better efficiency with respect to other estimators. However, due to the high computational complexity of the resulting QMIP problem, exact solutions for moderately large regression problems is infeasible. In this paper we further establish the theoretical properties of the PTS estimator, such as high breakdown and efficiency, and propose an approximate algorithm called Fast-PTS to compute the PTS estimator for large data sets efficiently. Extensive computational experiments on sets of benchmark instances with varying degrees of outlier contamination, indicate that the proposed algorithm performs well in identifying groups of high leverage outliers in reasonable computational time.Comment: 27 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Graph-Embedding Empowered Entity Retrieval

Author: D Metzler
DL Davies
K Balog
L McInnes
N Jardine
N Noy
PJ Rousseeuw
S Robertson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/05/2020
Field of study

In this research, we improve upon the current state of the art in entity retrieval by re-ranking the result list using graph embeddings. The paper shows that graph embeddings are useful for entity-oriented search tasks. We demonstrate empirically that encoding information from the knowledge graph into (graph) embeddings contributes to a higher increase in effectiveness of entity retrieval results than using plain word embeddings. We analyze the impact of the accuracy of the entity linker on the overall retrieval effectiveness. Our analysis further deploys the cluster hypothesis to explain the observed advantages of graph embeddings over the more widely used word embeddings, for user tasks involving ranking entities

arXiv.org e-Print Archive

Crossref

Crowdsourcing Dialect Characterization through Twitter

Author: Bruno Gonçalves
D Mocanu
David Sánchez
DT Pham
J Borge-Holthoefer
M Salathé
M Salathé
PJ Rousseeuw
Tobias Preis
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 26/07/2014
Field of study

We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.Comment: 10 pages, 5 figure

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

HAL AMU

Directory of Open Access Journals

PubMed Central

Digital.CSIC

Robust artificial neural networks and outlier detection. Technical report

Author: Andrei Kelarev
Cederman D
Gleb Beliakov
Huber PJ
John Yearwood
Makela MM
Mammadov MA
Masters T
Powell MJD
Press AH
Rousseeuw PJ
Rusiecki A
Sengupta S
Smola AJ
Publication venue: 'Informa UK Limited'
Publication date: 02/10/2011
Field of study

Large outliers break down linear and nonlinear regression models. Robust regression methods allow one to filter out the outliers when building a model. By replacing the traditional least squares criterion with the least trimmed squares criterion, in which half of data is treated as potential outliers, one can fit accurate regression models to strongly contaminated data. High-breakdown methods have become very well established in linear regression, but have started being applied for non-linear regression only recently. In this work, we examine the problem of fitting artificial neural networks to contaminated data using least trimmed squares criterion. We introduce a penalized least trimmed squares criterion which prevents unnecessary removal of valid data. Training of ANNs leads to a challenging non-smooth global optimization problem. We compare the efficiency of several derivative-free optimization methods in solving it, and show that our approach identifies the outliers correctly when ANNs are used for nonlinear regression

arXiv.org e-Print Archive

Deakin Research Online

Crossref

Federation ResearchOnline

Robust Fuzzy Clustering via Trimming and Constraints

Author: A Farcomeni
AK Lenstra
DW Hosmer
E Ruspini
E Trauwaert
H Fritz
H Späth
J Kim
J Łeski
JC Bezdek
KL Wu
LA García-Escudero
LA García-Escudero
LA García-Escudero
PJ Rousseeuw
PJ Rousseeuw
PJ Rousseeuw
R Krishnapuram
RJ Hathaway
RN Davé
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Producción CientíficaA methodology for robust fuzzy clustering is proposed. This methodology can be widely applied in very different statistical problems given that it is based on probability likelihoods. Robustness is achieved by trimming a fixed proportion of “most outlying” observations which are indeed self-determined by the data set at hand. Constraints on the clusters’ scatters are also needed to get mathematically well-defined problems and to avoid the detection of non-interesting spurious clusters. The main lines for computationally feasible algorithms are provided and some simple guidelines about how to choose tuning parameters are briefly outlined. The proposed methodology is illustrated through two applications. The first one is aimed at heterogeneously clustering under multivariate normal assumptions and the second one migh be useful in fuzzy clusterwise linear regression problems.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Documental de la Universidad de Valladolid

Archivio della Ricerca - Università di Roma 3

Combining semantic web technologies with evolving fuzzy classifier eClass for EHR-based phenotyping : a feasibility study

Author: AM Kaplan
F Skopik
G Salton
PJ Rousseeuw
S Spangler
WG Mangold
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

In parallel to nation-wide efforts for setting up shared electronic health records (EHRs) across healthcare settings, several large-scale national and international projects are developing, validating, and deploying electronic EHR oriented phenotype algorithms that aim at large-scale use of EHRs data for genomic studies. A current bottleneck in using EHRs data for obtaining computable phenotypes is to transform the raw EHR data into clinically relevant features. The research study presented here proposes a novel combination of Semantic Web technologies with the on-line evolving fuzzy classifier eClass to obtain and validate EHR-driven computable phenotypes derived from 1956 clinical statements from EHRs. The evaluation performed with clinicians demonstrates the feasibility and practical acceptability of the approach proposed

University of Liverpool Repository

University of Salford Institutional Repository

Crossref

The University of Manchester - Institutional Repository

Dynamic clustering of time series with Echo State Networks

Author: A Saxena
AK Jain
D Verstraeten
I Goodfellow
M Lukoševičius
M Velázquez-Mariño
PJ Rousseeuw
RA Becerra-García
S Aghabozorgi
S Ding
T Kohonen
Publication venue
Publication date: 05/06/2019
Field of study

In this paper we introduce a novel methodology for unsupervised analysis of time series, based upon the iterative implementation of a clustering algorithm embedded into the evolution of a recurrent Echo State Network. The main features of the temporal data are captured by the dynamical evolution of the network states, which are then subject to a clustering procedure. We apply the proposed algorithm to time series coming from records of eye movements, called saccades, which are recorded for diagnosis of a neurodegenerative form of ataxia. This is a hard classification problem, since saccades from patients at an early stage of the disease are practically indistinguishable from those coming from healthy subjects. The unsupervised clustering algorithm implanted within the recurrent network produces more compact clusters, compared to conventional clustering of static data, and provides a source of information that could aid diagnosis and assessment of the disease.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec

Crossref

Repositorio Institucional Universidad de Málaga

Sparse Robust Regression for Explaining Classifiers

Author: A Alfons
A Henelius
D Baehrens
E Amaldi
E Smucler
G Ausiello
H Mobahi
H Wang
PJ Rousseeuw
PJ Rousseeuw
PJ Rousseeuw
PT Komiske
R Guidotti
R Tibshirani
VJ Yohai
Publication venue: Springer Nature Switzerland
Publication date: 28/10/2019
Field of study

Recipient of the best student paper award.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Towards Spatial Word Embeddings

Author: DM Powers
J Han
M Imran
N Craswell
PJ Rousseeuw
SE Robertson
Y Fang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2019
Field of study

Leveraging textual and spatial data provided in spatio-textual objects (eg., tweets), has become increasingly important in real-world applications, favoured by the increasing rate of their availability these last decades (eg., through smartphones). In this paper, we propose a spatial retrofitting method of word embeddings that could reveal the localised similarity of word pairs as well as the diversity of their localised meanings. Experiments based on the semantic location prediction task show that our method achieves significant improvement over strong baselines

Crossref

Open Archive Toulouse Archive Ouverte